-
Notifications
You must be signed in to change notification settings - Fork 13.5k
server : support unified cache across slots #16736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/llama-context.cpp
Outdated
|
|
||
| uint32_t llama_context::n_ctx_per_seq() const { | ||
| return cparams.n_ctx / cparams.n_seq_max; | ||
| return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.
We can also cap it for the non-unified case?
| return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max; | |
| return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also cap it for the non-unified case?
What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:
Lines 19708 to 19724 in 7863fcc
| const auto padding = llama_kv_cache::get_padding(cparams); | |
| uint32_t n_ctx_per_stream = cparams.n_ctx; | |
| if (!cparams.kv_unified) { | |
| n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max; | |
| n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding); | |
| cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max; | |
| } else { | |
| n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding); | |
| cparams.n_ctx = n_ctx_per_stream; | |
| } | |
| LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx); | |
Since we no longer need the padding logic (as of #16148 and related) we should simplify this.
I'll push a separate PR for this and then will come back to polishing this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now rebased on top of the changes in #16812. The result is that we determine the KV cache size during context creation and there should be no leftover KV cells.
Note that since we now cap the context size to the training context size, the user code is recommended to query llama_n_ctx and llama_n_ctx_seq after creating the llama_context in order to obtain the actual context size. I'll add comments in llama.h to reflect this.
Will try to clean-up this PR next and will open it for review when ready.
55bb9db to
6369fe0
Compare
6369fe0 to
ac261be
Compare
| if (cparams.n_ctx_seq > hparams.n_ctx_train) { | ||
| LLAMA_LOG_WARN("%s: n_ctx_seq (%u) > n_ctx_train (%u) -- possible training context overflow\n", | ||
| __func__, cparams.n_ctx_seq, hparams.n_ctx_train); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch should not be reached due to the capping above on line 117. But keeping it in case the capping logic gets changed in the future.
0ba88d3 to
4e9e319
Compare
|
Ready for review. I've marked some TODOs for follow-up PRs since I think the current implementation is quite basic and at the same time gets us 90% on the way to the ideal logic. Will improve the rest of the cases from |
| if (cparams.kv_unified) { | ||
| cparams.n_ctx_seq = cparams.n_ctx; | ||
| } else { | ||
| cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max; | ||
| } | ||
|
|
||
| if (cparams.n_ctx_seq > hparams.n_ctx_train) { | ||
| LLAMA_LOG_WARN("%s: capping n_ctx_seq (%u) to n_ctx_train (%u)\n", __func__, cparams.n_ctx_seq, hparams.n_ctx_train); | ||
|
|
||
| cparams.n_ctx_seq = hparams.n_ctx_train; | ||
| } | ||
|
|
||
| if (cparams.kv_unified) { | ||
| cparams.n_ctx = cparams.n_ctx_seq; | ||
| } else { | ||
| cparams.n_ctx = cparams.n_ctx_seq * cparams.n_seq_max; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely convinced about this, I think it may create confusion, and add complexity to applications. The server and other applications using the unified cache need a sequence length limit independent of n_ctx, but that should probably be a different parameter that defaults to min(n_ctx, n_ctx_train). This would be an application parameter, not part of the llama.cpp API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Just remove the capping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it would be preferable to not have a limit here. The user should be able to override the model n_ctx_train, and it is easier to do it this way than with a KV override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the capping logic to the llama-server.
| if (params.n_parallel == 1 && params.kv_unified == false) { | ||
| LOG_WRN("%s: setting n_parallel = 4 and kv_unified = true\n", __func__); | ||
|
|
||
| params.n_parallel = 4; | ||
| params.kv_unified = true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this can't be default params in arg.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can make it the default - I thought that some of the examples might not like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah I didn't notice that there are multiple example all using n_parallel
In this case, maybe we can use a dedicated variable for server, like params.n_parallel_server ?
This can be useful when auto-generating the documentation for server args
| n_batch /= 2; | ||
| } | ||
|
|
||
| SRV_WRN("failed to find free space in the KV cache, retrying with smaller batch size, i = %d, n_batch = %d, ret = %d\n", i, n_batch, ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this warning should be moved inside the if condition above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe I forgot this from a discussion before, but currently in which case we need to retry with a small batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main case for retrying with smaller batches was back when we didn't have ggml_set_rows and we always had to search for contiguous set of cells (KV slots) inside the cache buffer to place the input batch. Now with ggml_set_rows this is no longer needed and technically, retrying with a smaller batch size almost has almost no purpose except in some rare cases.
But generally, when llama_decode returns 1, you should retry with a smaller batch.
Hmm this could be a bit of a bummer in term of UX. For example this case:
An idea for improvement could be to only allow starting a task when the remaining context passes a threshold (maybe free more than a half?), otherwise defer the task. (Ofc we can implement this in a follow-up PR) |
|
Yes, there are several edge cases that can be handled better. This case specifically probably can be handled even better - when the total context gets filled up, move one active sequence to host-memory cache and resume it later when another sequence finishes. This way we don't need special thresholds and both sequences would eventually finish (after a short pause for one of them of course). |
93373cc to
c08d0d1
Compare
src/llama-context.cpp
Outdated
| if (cparams.kv_unified) { | ||
| cparams.n_ctx_seq = cparams.n_ctx; | ||
| } else { | ||
| cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an error could be returned here if n_ctx is not a multiple of n_seq_max, since that's likely to be a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning. The problem that I see with throwing an error is that the user might often want to use the default training context for example split among 3 sequences. And in the majority of cases the training context typically a power of 2 would not be divisible by 3, resulting in an error.
|
Hi, by default there is unified KV cache with 4 slots, but setting only "--parallel 1" still uses 4 slots with unified KV cache is this correct that "--parallel 1" doesn't work as expected (should use 1 slot) without "--kv-unified"? full command: |
ref #4130 (reply in thread)
Current logic in this PR (subject to change):
-kvu, share the entire context-c Namong all parallel slots of the server-np N-np Nargument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot contextExample:
TODO:
Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391Future improvements: